College Scorecard Analysis by Justin “Roy” Garrard

The data explored in this report comes from College Scorecard. College Scorecard is a product of the U.S. Department of Education and contains college statistics from 1996 to 2014, though this particular analysis will only be looking at four-year universities from the 2014 data.

Univariate Plots Section

##      STATE                  FUNDING_TYPE      REGION       LATITUDE    
##  CA     : 189   Public            : 554   5      :449   Min.   :13.43  
##  NY     : 164   Private Non-Profit:1213   2      :372   1st Qu.:34.28  
##  PA     : 116   Private For-Profit: 257   3      :301   Median :39.48  
##  TX     : 100                             8      :274   Mean   :37.95  
##  IL     :  83                             4      :193   3rd Qu.:41.79  
##  FL     :  82                             6      :166   Max.   :64.86  
##  (Other):1290                             (Other):269                  
##    LONGITUDE        ADM_RATE_ALL    TUITIONFEE_IN   TUITIONFEE_OUT 
##  Min.   :-157.89   Min.   :0.0000   Min.   : 2019   Min.   : 2475  
##  1st Qu.: -96.65   1st Qu.:0.5491   1st Qu.: 8633   1st Qu.:14682  
##  Median : -85.18   Median :0.6871   Median :15024   Median :20868  
##  Mean   : -89.39   Mean   :0.6663   Mean   :19222   Mean   :22650  
##  3rd Qu.: -77.00   3rd Qu.:0.7935   3rd Qu.:28478   3rd Qu.:29526  
##  Max.   : 144.80   Max.   :1.0000   Max.   :51008   Max.   :51008  
##                    NA's   :623      NA's   :377     NA's   :377    
##  UNDERGRAD_ENROLL    RETENTION      GRAD_DEBT_MDN   WDRAW_DEBT_MDN 
##  Min.   :    0.0   Min.   :0.0000   Min.   : 2100   Min.   : 2113  
##  1st Qu.:  820.5   1st Qu.:0.6647   1st Qu.:21000   1st Qu.: 8125  
##  Median : 2052.0   Median :0.7546   Median :24750   Median : 9500  
##  Mean   : 4952.4   Mean   :0.7274   Mean   :23781   Mean   : 9842  
##  3rd Qu.: 5652.5   3rd Qu.:0.8347   3rd Qu.:27000   3rd Qu.:11118  
##  Max.   :52280.0   Max.   :1.0000   Max.   :49750   Max.   :30250  
##  NA's   :281       NA's   :436      NA's   :356     NA's   :357    
##  COMPLETION_FIVE_YRS
##  Min.   :0.0000     
##  1st Qu.:0.3689     
##  Median :0.5026     
##  Mean   :0.5050     
##  3rd Qu.:0.6478     
##  Max.   :1.0000     
##  NA's   :430

There are roughly 2000 points of data, each of which was constrained to fifteen variables (name and ID number were excluded from the summary).

The dataset shows that there are approximately four times as many universities east of the -100 longitude line than there are west of the line. One interesting note is that the heat map of university placement closely matches NASA light pollution images, implying a correlation of universities to urban areas.

The dataset helpfully breaks up location information into nine regions. Excluding the ninth region (which covers U.S. Territories like Puerto Rico), the fewest universities are found in region seven. A more visual representation of this data can be found in the Bivariate Plots section.

California is the leader in regards to universities per state. New York follows a reasonable distance behind. Puerto Rico manages to surpass several states, while many of the other territories lack even a single university.

## [1] "In-State Tuition Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2019    8633   15020   19220   28480   51010     377
## [1] "Out-of-State Tuition Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2475   14680   20870   22650   29530   51010     377

The median In-State Tuition is slightly above 15,000. The median Out-of-State Tuition is close to 21,000.

## [1] "Graduate Debt Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2100   21000   24750   23780   27000   49750     356
## [1] "Dropout Debt Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2113    8125    9500    9842   11120   30250     357

Predictably, graduating students have significantly higher median debt levels than those who withdraw from university. Also of note is how much more tightly clustered and consistent dropout debt is versus graduate debt.

## [1] "Undergraduate Enrollment Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   820.5  2052.0  4952.0  5652.0 52280.0     281

Enrollment varies wildly, which is to expected given the numerous sizes of universities. One notable outlier in the data is the University of Phoenix, which is the only university to report a six figure enrollment (151,600). This data point has been excluded for most of the analysis in this report.

## [1] "Undergraduate Five-Year Completion Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.3689  0.5026  0.5050  0.6478  1.0000     430

Graduation rates follow a surprisingly normal distribution. There are a worrisome number of universities (~50) whose completion rates fall below 0.1. These values are distinct from NA, so it would seem that they were deliberatly reported as such. Of note is their funding types (rarely public) and admission rates (generally an NA value).

## [1] "Admission Rates Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.5491  0.6871  0.6663  0.7935  1.0000     623

Admission rates trend towards accepting more often than rejecting, but I imagine that this varies by other conditions (tution, university funding type, etc.)

## [1] "Retention Rates Summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.6647  0.7546  0.7274  0.8347  1.0000     436

There is a concerning number of 0.0 and 0.01 retention rates. As before though, the data draws a distinction between 0.0 and NA, so these were likely reported as such. Again, these points are similar in that they’re not public universities and rarely have a listed admission rate.

Private Non-Profit schools make up a decided majority of the data points.

Univariate Analysis

What is the structure of your dataset?

There are roughly 2000 points of data, each of which was constrained to fifteen variables. The variables themselves can be grouped into four categories:

  • Location (latitude, longitude, state, region)
  • Finance (tuitionfee_in, tuitionfee_out, grad_debt_mdn, wdraw_debt_mdn)
  • Admission (adm_rate_all, undergrad_enrollment, retention, completion_five_yrs)
  • Identification (unitid, name, funding type)

What is/are the main feature(s) of interest in your dataset?

The primary feature of interest is the completion rate, which offers a quantitative view of a university’s effectiveness. A low completion rate might imply exclusivity, but it also translates into students with debt and nothing to show for it.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I anticipate that the funding type (public, private) and size (admission rate) will be strong indicators. Other qualities, such as location and retention, may also shed some light on the situation.

Did you create any new variables from existing variables in the dataset?

No, though it may be handy to have some form of “success quotient” relating factors such as completion rate and low tuition.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • University enrollment was adjusted for easier viewing, as there is a significant spread in the amounts of students admitted.

  • Five year completion rates had an unusually normal distribution.

  • The ninth region (U.S. territories) and Alaska were excluded from the longitude/latitude scatter plot. This was done for easier viewing.

  • The University of Phoenix was removed from the dataset, as its enrollment was more than 70 times the median. It’s many sub-schools were kept.

Bivariate Plots Section

Other than in-state and out-of-state tuition, there are no obvious correlations. There are a few promising leads though, including tuition/completion rate and retention/completion rate.

Adding color to the geographic maps helps to better demonstrate the shape of the regions.

There’s considerable variation between regions with regards to completion rates. The Northeast regions (1, 2, 3) and West Coast region (8) have a median above 0.5. The Northwest (7), Central (4), and South (5, 6) regions fall below 0.5. U.S. Territories (9) are particularly afflicted, with a median completion rate near 0.3.

The funding type of school is shown to be another notable indicator of completion rate. Private For-Profit universities fall far below Public universities in completion rate. Private Non-Profit universities have a noticable, if not exceptional, advantage over Public schools.

## [1] "Correlation: In-State Tuition and Completion Rate"
## [1] 0.5099584
## [1] "Correlation: Out-of-State Tuition and Completion Rate"
## [1] 0.651668

Out-of-state tuition appears to be a decent indicator for completion rate. In-state tuition follows a similar pattern, but clustering (likely from tuition subsidies) ruins the trend.

Binning out-of-state tuition into a box-plot makes the trend a little easier to follow.

## [1] "Correlation: Retention and Completion Rate"
## [1] 0.6577093

Retention is related to completion rates. This makes sense, since students not retained cannot graduate (though students who are retained may take longer than five years to graduate).

## [1] "Correlation: Admission Rate and Completion Rate"
## [1] -0.2769716
## [1] "Correlation: Enrollment and Completion Rate"
## [1] 0.189623

Neither admission rate or undergraduate enrollment seem to be related to completion rate.

The median debt of graduating students does not appear to have any discernable relationship to completion rate.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Out-of-state tuition, student retention, the funding type of university, and the region of the university all seem to have some relationship to completion rates. The size of the university and its exclusivity (admission rate) have less bearing. Likewise, the median debt of a graduate seem unrelated to completion rate.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

In-state tuition and out-of-state tuition are correlated, which is to be expected, but the shape of their scatter plots offers some interesting insight into how tuition rates are set. There’s a strong clustering around the $7,000 mark and a break before the $20,000 mark. Following $20,000, in-state tuition looks nearly identical to out-of-state tuition. This shape implies that universities drawinga distinction between in-state and out-of-state tuition have rates that are less than $20,000.

Median graduate debt also exhibits a strange patterning. Hard lines exist at the $25,000 and $27,000 values, utterly independent of completion rate. This suggests a “standard value” of sorts that universities and financial aid packages aim for.

What was the strongest relationship you found?

The funding type of university has a noticable relationship to its completion rate. For-profit universities have, on average, lower completion rates than either public or non-profit universities.

Multivariate Plots Section

Adding funding type and region to the plot outlines an interesting divide between universities. For-Profit universities make up the lowest completion rates and tuitions. Non-Profit universities comprise the majority of high-tuition, high-completion rate schools. Public universities generally fall in-between.

Each region seems to experience this discrepency in a different way. Region 9 (U.S. Territories) has a surprising number of non-profit universities. Region 8 (West Coast) seems to go against the trend with an odd patterning of public and non-profit schools. Region 5 (Southeast) has so many universities that it’s hard to see a trend.

Categorizing funding types by color gives this plot a new story to tell. Public universities manifest the most consistent relationship while non-profits are more erratic. For-profit universities, confusingly, show no pattern. That would imply that there is no connection between the number of students that graduate in five years and the number of students that choose to stay each year.

This plot had previously been an enigma. Looking through the lens of funding type, however, shows useful patterns. Public and non-profit universities have a fairly similar structure with graduate debt. Non-profits in particular show hard lines at 25,000 and 27,000.

For-profit universities show the highest amount of debt. I find this surprising, given it was previously established that they have lower tuition rates on average.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Tying together out-of-state tuition, university funding type, and region proved a very effective means for understanding completion rates. Independently they have weak relationships to one another, but when drawn together, they show some remarkable patterns.

Were there any interesting or surprising interactions between features?

That for-profit schools have some of the lowest tuitions but highest graduate debts was unexpected (and worrisome). I would be curious to discover why this is the case. Perhaps there are financial aid restrictions in play? Or is it the demographics?

For-profit schools seem strange on the whole. In graduate debt and retention, they follow completely different patterns than other universities. Is this because of differences in student demographic? Or something else entirely?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

N/A


Final Plots and Summary

Plot One

Description One

This plot simulataneously outlines the relative proportions of universities in the U.S. while providing context for the “Region” variable that’s referenced throughout. Readers can quickly deduce that the majority of universities exist in the Eastern U.S. They can also see the vivid outline of the West Coast and blatant sparsity of the Midwest.

Plot Two

Description Two

There’s a lot going on in this map, but it demonstrates some crucial relationships. First, it shows the relationship between university completion rate and tuition. Some regions demonstrate this more than others, but the overall trend can be observed. Second, it illustrates a national pattern where public, non-profit, and for-profit schools exist on a spectrum. Dividing the map by region mitigates the number of outliers viewers have to sift through. Lastly, the chart juxtoposes the various regions with one another. It’s easy to see at a glance that regions 9 and 7 have fewer universities than 5 or 2.

Plot Three

Description Three

This plot was an unexpected surprise. It shows the overall pattern of completion rate and tution fees, but also the effect of in-state tuition. There isn’t a simple left-shift like what one might expect; rather, it’s as though there’s some gravitational force pulling points towards the $5,000 mark.

It’s also interesting to note that the plot is relatively unaffected past the $20,000 mark. Given the similarity in shapes between the two graphs, one can deduce that few schools beyond that point offer different in-state tuitions.


Reflection

The College Scorecard is a vast repostiory of information across several years. This analysis covered data for four-year universities in the year 2014. I placed an emphasis on fifteen variables in particular, which detailed location, admissions, finances, and identification.

My focus was on investigating the completion rate of various universities. Tuition rates, university funding types, and region were found to be influential. The most significant influence came from the university funding type, as for-profit schools on average have half the completion rate of non-profit schools. Similarly, higher tuition rates seem to go hand-in-hand with higher completion rates.

There was some difficulty in importing, cleaning, and understanding the College Scorecard data. The data itself is massive; my spreadsheet program could not open the .csv because there were simply too many columns. Many of the column names were unintuitive as well (C150_4 comes to mind).

I was pleased with how questions raised by the Bivariate Plots were answered by the Multivariate Plots. Looking at the data through the lens of university funding type showed just how distinct each group is. Future work could certainly be done with regards to investigating the patterns within each funding type, and across multiple years.